The population database contains counts for six age groups, whereas the California and LA County databases use four. Because the groupings align — “0–4,” “5–11,” and “12–17” in the population database correspond to the “0–17” group in the others — we summarized the population counts to match the four age-group format.
In the original population database, counts are reported for each single year of age within every demographic subgroup (e.g., county, sex, race/ethnicity). To obtain total estimates per age group and demographic category, the single-year counts were aggregated (summed). Table 1 below presents a subset of the original and restructured population data to illustrate the resulting summarized age groupings.
3.0.2 Table 1: Reconciling Age Group Categories and Population Counts
Original Age Groups:
County
Sex
Race / Ethnicity
Age Group
Population
Alameda
FEMALE
Hispanic
0-4
11587
Alameda
FEMALE
Hispanic
12-17
17098
Alameda
FEMALE
Hispanic
5-11
17761
Total
-
-
-
46,446
New, Aggregate Age Group:
County
Sex
Race / Ethnicity
Age Group
Population
Alameda
FEMALE
Hispanic
0-17
46,446
3.0.3 Infection Dates
To support date-based analyses, infection records will be aggregated by MMWR week and year. For both the Los Angeles County and California datasets, we will generate two new columns: mmwr_year and mmwr_week, then remove the original date-based fields. We will also add columns start_date and end_date to serve as reference points should we need them later.
In the California dataset, the field time_int encodes the year and MMWR week as a six-digit integer (YYYYWW). To create the new fields, we extract the first four digits as mmwr_year and the last two digits as mmwr_week, then drop the original time_int column.
In the Los Angeles County dataset, the codebook identifies a field dt_report as the last day of the MMWR week. However, this field contained only missing values, so it was removed. Instead, we convert the infection date field, dt_dx to a proper date format, and then use the MMWRweek package to derive the mmwr_year and mmwr_week.
Code
##-- California dataset:step2_ca_df <- step1_ca_df %>%##--pull MMWR week and year from time_int fieldmutate(mmwr_year =factor(time_int %/%100), mmwr_week =factor(time_int %%100)) %>%add_start_end_dates() %>%select(-time_int) %>%relocate(mmwr_year, mmwr_week, start_date, end_date, .before =everything())##-- LA county dataset:step2_la_cnty_df <- step1_la_cnty_df %>%##--restructure to proper date formatmutate(DATE_FIX =as.Date(parse_date_time(dt_dx, "%d%b%Y"), format ="%Y-%m-%d")) %>%##--use date to create new MMWR fieldsadd_mmwr_week_columns(date_col ="DATE_FIX") %>%add_start_end_dates() %>%select(-c(DATE_FIX, dt_dx)) %>%relocate(mmwr_year, mmwr_week, start_date, end_date, .before =everything()) %>%relocate(county, .before = age_cat)
To streamline this process, we created two helper functions:
add_mmwr_week_columns() : takes date column and adds two fields: mmwr_year and mmwr_week
add_start_end_dates() : uses those values to generate corresponding MMWR week start and end dates
The dataframes now have a structure that looks like this:
mmwr_year
mmwr_week
start_date
end_date
county
age_cat
new_infections
2023
22
2023-05-28
2023-06-03
Los Angeles
0-17
15
2023
23
2023-06-04
2023-06-10
Los Angeles
0-17
17
2023
24
2023-06-11
2023-06-17
Los Angeles
0-17
23
3.1 Race and Ethnicity:
Each of the three datasets defines Race / Ethnicity differently. The California dataset uses numeric codes, the Los Angeles County dataset uses full text labels, and the population dataset uses abbreviated text.
To resolve this, we created a crosswalk file (race_ethnicity_map.csv) that aligns the three formats. By joining this crosswalk to each dataset, we ensure that all three contain a consistent set of race and ethnicity variables: each with the numeric code, the abbreviated text, and the full text label.
Multiracial (two or more of above races), Non-Hispanic
MR NH
7
Hispanic (any race)
Hispanic
9
Unknown
Unknown
3.2 Population Counts by Demographic
To calculate infection rates by demographic groups (such as county, health officer region, sex, or race/ethnicity), we first summarize total population counts for each demographic category within the population dataframe. The population dataset is then joined to both the master database (which merges all three datasets) and the individual California and Los Angeles datasets.
By creating and maintaining these summarized population counts, we avoid having to recalculate them each time we focus on a different demographic group. For example, once this population dataset is joined to the California data, we can easily calculate population-adjusted infection rates that allow valid comparisons across counties with differing population sizes.
Code
step2_pop_df <- step2_pop_df_recat %>%##-- join population database to the race/ethnicity mapmutate(race_short =clean(race_ethnicity)) %>%select(-race_ethnicity) %>%left_join(race_ethnicity_map, by ="race_short") %>%relocate(race_coded, race_short, race_long, .after = sex) %>%##-- calculate population totals by demographicgroup_by(county, health_officer_region) %>%mutate(total_cnty_pop =sum(pop)) %>%ungroup() %>%group_by(county, health_officer_region, race_coded, race_short, race_long) %>%mutate(total_race_pop =sum(pop)) %>%ungroup() %>%group_by(county, health_officer_region, sex) %>%mutate(total_sex_pop =sum(pop)) %>%ungroup() %>%group_by(health_officer_region) %>%mutate(total_HOR_pop =sum(pop)) %>%ungroup()
3.2.1 Table 2: Calculating Infection Rate by 100,000 Population Example